Amazon Sagemaker Debugger
- Paper: https://assets.amazon.science/0b/cb/47bb9a1e4b6a8f78ed7a7611f4a7/amazon-sagemaker-debugger-a-system-for-real-time-insights-into-machine-learning-model-training.pdf?fbclid=IwAR2p_Jxj4CJTA7ESs4_DhoTYveGyNfHLRVk3Zimfb-Vd4W_bzkNkgHpz7MM
- Sagemaker debugger
- identifies and stops underperforming jobs
- framework agnostic
- vanishing or exploding gradients
- neuron saturation
- overfitting
- "smdebug"
- Lifecycle
- data prep
- model training – iterative
- monitor and stop early in case of issues
- hyperparameter tuning
- deployment
- Sagemaker's approach
- smdebug: record and load tensors
- rules to analyze a job
- instrumentation
- pytorch forward hooks, etc.
- smdebug wraps the apis
- containers are pre-modified
- specify tensors by regex
- recorded as protobuf files (tensorboard?) – analyze with smdebug
- written tensors are analyzed by cloudwatch for early stopping rules
- scaling
- offload into separate containers
- optimizations: sampling, aggregations, save intervals
- store separately / allow for compute
- allows for handling data volume
- rules
- measure imbalance
- check that inputs are normalized correctly, 0 mean/1 variance
- activation functions
- neurons suffer saturation – leading to vanishing gradients
- dying relu
- fixed by scaling to allow for symmetric initialization
- loss
- loss not decreasing
- overfitting
- underfitting
- tensors
- all zeros
- all small
- values unchanging
- parameter initialization – check properties
- thresholds for gradients
- compare weight updates with gradient
- overfitting with eval loss > train loss
- if eval loss exceeds train loss at some point
- xgboost – large or shallow trees
- Applications
- additional visualizations to tune the model
- iterative model pruning – stop the model before loss stops
- model understanding
- Case study
- saving tensors every 10 steps causes a 1.9x slowdown
- but every 200 steps is within 1,2
- highlights
- easier tensor access without modifying training
- can modify access while training is in progress
- rule analysis is out of band
- Follow up resources
- evaluate smdebug source code
- imbalance ratio – johnson & khoshgoftaar – max classes / min classes
- salience maps
- tensorwatch